SF rents have been steadily rising since 2000 to 2010 when 28,500 newcomers arrived to the city. From 2010-2012, 20,600 additional people moved in, in just three years. People have attributed the growth of population to the blazing technology sector.
While the demand for housing increased, supply was relatiively fixed due to zoning laws. The most fundamental rule of economics dictates that prices rise when there is a high demand but low supply. I was interested in seeing how crime affects prices in certain neighbourhoods and if there was a relation between price and crime.
To complete this goal I analyzed trends of reported crime and rent hikes in different neighborhoods and reported trends with exploratory graphs. Predict prices of housing in San Francisco using districts, years, bedrooms, and aggregate crime counts per district per year. Use Regression, CART, Random Forests to predict rental prices based on variables such as categorized crime counts, bedrooms, and year.
import pandas as pd
import numpy as np
import shapely
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
The project involves an in depth analysis of the change of rental and crime rates in San Francisco from 2003-2015. Data was retrieved from a multitude of sources but mainly Eric Fischers github repo of aggregated Craigslist rental data archived from wayback machine and the SF crime data set from opendata.
Original data from craigslist appears as below, and I needed to parse and extract attributes specifically price, bedrooms, and district. This task was performed through a variety of data structure methods and regular expressions. The issue with craigslist data is that posts are inconsistent in structure and name of districts, to counter this problem I used scoring algorithms vs a dictionary of San Francisco districts to standardize my data.

Variables were obtained from the craigslist data and locations of districts were scraped from Google. The rest of the variables were present in the crime dataset. Engineered a few binary features based on whether a crime was violent, alcohol related, organized, narcotic, or city related.
#crime = pd.read_csv('./train.csv') # read in crime
crime = pd.read_csv('./SFPD_Incidents_-_from_1_January_2003.csv')
crime_sub = crime.sample(100000) #subset sample
crime_sub.head(1)
district_coordinates = pd.read_csv('./district_coordinates.csv') # read in coordinates
del district_coordinates['Unnamed: 0']
district_coordinates = district_coordinates.dropna()
district_coordinates.head() # 4 coordinate places is an accuracy of 11 meters
crime_sub.info()
crime_sub.head()
posts_crimes.head()
Below are a series of several plots that annotate crime per year for all of sf, median price per year for all of sf, crime/price for districts I thought would represent the most possible neighborhoods and population of San Francisco. Finally there is a series of plots of crimes of interest from the engineered dummy variables based on whether a crime was violent, alcohol related, organized, narcotic, or city related.










sns.set(rc={"figure.figsize": (10, 10)})
sns.set_style('white')
print crime_sub.Category.value_counts()[0:10]
crime_sub.Category.value_counts().plot(kind = 'barh')
sns.set(rc={"figure.figsize": (10, 10)})
sns.set_style('white')
price_by_hood_pre = posts_crimes[posts_crimes.bed == 1]
price_by_hood = pd.DataFrame(price_by_hood_pre.groupby(['district1','year']).median()['price']).reset_index()
price_by_hood[price_by_hood.year == (2003)].groupby(['district1','year']).median().sort_values('price').plot(kind = 'barh')
price_by_hood[price_by_hood.year == (2015)].groupby(['district1','year']).median().sort_values('price').plot(kind = 'barh')
The crime dataset had different districts of SanFrancisco (Police Districts) than the craigslist data districts. This created a problem since there were no common columns to perform a merge on. To solve this issue I scraped coordinates for each craigslist district center and wrote a function that calculates the closest distance to the crime location coordinates. The function uses the haversine distance as a scoring metric creating a list of scores from each district to the crime location, the distance that is the lowest is assumbed to be the district that the crime is in.
# get decimals as a feature X1 X2 from the location coordinates
from decimal import Decimal
def replace_lon(location):
lon = str(location).replace('(','').replace(')','').split(',')
lon = np.float64(lon[1])
return lon
def replace_lat(location):
lat = str(location).replace('(','').replace(')','').split(',')
lat = np.float64(lat[0])
return lat
crime_sub['X1'] = crime_sub.apply(lambda row: replace_lon(row['Location']),axis = 1)
crime_sub['Y1'] = crime_sub.apply(lambda row: replace_lat(row['Location']),axis = 1)
# haversine distances to check which crime occured in what district
%timeit
from haversine import haversine
def check_distance(location):
#print location
haversine_list = []
for d,lat,lon in zip(district_coordinates.district,district_coordinates.lat, district_coordinates.lon):
haversine_list.append((haversine(location, (lon,lat)),d))
match = sorted(haversine_list)[0][1]
return match
crime_sub['match_district'] = crime_sub.apply(lambda row: check_distance((row['X1'],row['Y1'])), axis = 1 )
Create sub crime categories based on crimes that could be considered under the same category.
# create crime Categories
crime_sub.Category.value_counts()
theft = ['LARCENY/THEFT','VEHICLE THEFT','BURGLARY','ROBBERY','STOLEN PROPERTY','RECOVERED VEHICLE']
drunk = ['DRUG/NARCOTIC','DRUNKENNESS','LIQOUR LAWS','DRIVING UNDER THE INFLUENCE','LIQUOR LAWS']
organized = ['BRIBERY','EXTORTION','PROSTITUTION']
violent = ['ASSAULT','SEX OFFENSES, FORCIBLE','KIDNAPPING']
narcotic = ['DRUG/NARCOTIC']
city = ['VANDALISM','LOITERING']
crime_sub['is_theft'] = crime_sub.Category.apply(lambda x: 1 if x in theft else 0)
crime_sub['is_drunk'] = crime_sub.Category.apply(lambda x: 1 if x in drunk else 0)
crime_sub['is_organized'] = crime_sub.Category.apply(lambda x: 1 if x in organized else 0)
crime_sub['is_violent'] = crime_sub.Category.apply(lambda x: 1 if x in violent else 0)
crime_sub['is_narcotic'] = crime_sub.Category.apply(lambda x: 1 if x in narcotic else 0)
crime_sub['is_city'] = crime_sub.Category.apply(lambda x: 1 if x in city else 0)
Use pandas date time to convert date column into the index, then use that information to create a yearly feature for the crimes.
from datetime import datetime as dt
print crime_sub.index.dtype
crime_sub['Date'] = pd.to_datetime(crime_sub['Date'])
crime_sub.set_index('Date',inplace= True)
print crime_sub.index.dtype
# create a year function, 2016 is not completely represented
crime_sub['year'] = crime_sub.index.map(lambda val: val.year)
crime_sub.head(1)
crime_sub = crime_sub[crime_sub['year'] < 2016]
posts = pd.read_csv('./craigslist_posts.csv')
posts.head()
posts = posts[posts.bed < 6]
Use a series of transformations to merge the data from 2 separate data frames of craigslists posts and crimes to aggregate counts of crime categories per district per year and the rental listing information.
crime_groups = pd.DataFrame(crime_sub.groupby(['year','Category','match_district']).size())
crime_wide = crime_groups.reset_index()
crime_wide.head()
crime_wide.columns = ['year','crime','district','crime_count']
crime_wide.head()
print [x for x in crime_wide.district.values if not type(x) == str]
crime_wide = crime_wide[[True if type(x) == str else False for x in crime_wide.district.values]]
crime_wide.district = crime_wide.district.map(lambda x: x.lower())
crime_wide_ = pd.pivot_table(crime_wide, index=['year','crime'], columns=['district'], fill_value=0)['crime_count'].reset_index()
crime_wide_.head()
crime_trans = crime_wide.T
posts = posts.iloc[:,1:]
posts.head()
crime_trans.columns = np.ravel(crime_trans.ix['crime',:].values)
crime_trans.head()
small_posts = posts
small_posts.shape
small_posts.year.value_counts()
# Transformations to aggregate counts
new_dfs = []
for i, row in small_posts.iterrows():
ndf = pd.DataFrame({
'post':[row.post],
'year':[row.year],
'price':[row.price],
'bed':[row.bed],
'district1':[row.district1],
'district2':[row.district2],
'district3':[row.district3]
})
ndf.index = [i]
d1_mask_colmask = (crime_trans.loc['year',:] == row.year) & (crime_trans.loc['district',:] == row.district1)
d2_mask_colmask = (crime_trans.loc['year',:] == row.year) & (crime_trans.loc['district',:] == row.district2)
d3_mask_colmask = (crime_trans.loc['year',:] == row.year) & (crime_trans.loc['district',:] == row.district3)
district1_crimes = crime_trans.ix[['crime_count'], d1_mask_colmask]
district2_crimes = crime_trans.ix[['crime_count'], d2_mask_colmask]
district3_crimes = crime_trans.ix[['crime_count'], d3_mask_colmask]
district1_crimes.columns = ['D1_'+x for x in district1_crimes.columns]
district2_crimes.columns = ['D2_'+x for x in district2_crimes.columns]
district3_crimes.columns = ['D3_'+x for x in district3_crimes.columns]
dist_crimes = pd.concat([district1_crimes, district2_crimes, district3_crimes],
axis=1, ignore_index=False)
dist_crimes.index = [i]
ndf = pd.concat([ndf, dist_crimes], axis=1, ignore_index=False)
new_dfs.append(ndf)
posts_crimes = pd.concat(new_dfs, axis=0, ignore_index=False).fillna(value=0, axis=1)
posts_crimes.head()
from sklearn.preprocessing import LabelEncoder
posts_crimes.district1 = posts_crimes.district1.map(lambda x: 'none' if x == 0 else x)
posts_crimes.district2 = posts_crimes.district2.map(lambda x: 'none' if x == 0 else x)
posts_crimes.district3 = posts_crimes.district3.map(lambda x: 'none' if x == 0 else x)
label_encoder = LabelEncoder()
dists = posts_crimes.district1.values.tolist() + posts_crimes.district2.values.tolist() + posts_crimes.district3.values.tolist()
dists = np.unique(dists)
# print dists
# print posts_crimes.district1.unique()
# print posts_crimes.district2.unique()
# print posts_crimes.district3.unique()
label_encoder.fit(dists)
posts_crimes['district1_code'] = label_encoder.transform(posts_crimes.district1.values)
posts_crimes['district2_code'] = label_encoder.transform(posts_crimes.district2.values)
posts_crimes['district3_code'] = label_encoder.transform(posts_crimes.district3.values)
posts_crimes.to_csv('./aggregate_df.csv')
posts_crimes = pd.read_csv('./aggregate_df.csv')
del posts_crimes['Unnamed: 0']
posts_crimes.head(1)
posts_crimes = posts_crimes[posts_crimes.price < 50000]
target = posts_crimes.price
tmp = posts_crimes[[col for col in posts_crimes if posts_crimes[col].dtype != object or col == 'district1']]
print posts_crimes.post.dtype
tmp.head()
bigml = [c for c in tmp if 'D1' in c ]+['bed','year','district1_code','district1','price']
bigml = tmp[bigml]
bigml.to_csv('bigml.csv')
#pd.read_csv('./bigml.csv')
D1 = [c for c in tmp if 'D1' in c ]+['bed','year','district1_code']
tmp[D1].head()
district_codes_df = tmp[D1]['district1_code']
district_codes_df.head()
The goal of this analysis was to use the types of crime 40 categories, # of bedrooms, years and the district code as a numerical variable to predict the price of rent. To accomplish this I first used linear-regression and performed an r^2 value of .43. As a result of having so many different features 40 crime categories, year price I felt that using a decision tree with cross validation of train test split and sample splits of 2,5,10 would perform better. I used grid search to find the best DecisionTreeRegressor (.57) and a depth of 9. Next I used feature importances to select for the highest performers and ran the best model decision tree on only the seected features and it performed at .60 with the most important features being bed year district and larceny. Next I ran random forests and the score improved to .62 with the selected features.
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
X = tmp[D1]
y = target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
lm = LinearRegression()
model = lm.fit(X_train, y_train)
y_pred = lm.predict(X_test)
print model.score(X_test, y_test) # coefficient of r2
print model.score(X_test,y_pred)
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(tmp[D1], target, test_size=0.33)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
param_grid = {'max_depth': np.arange(3, 10),'min_samples_split':[2,5,10]}
tree = GridSearchCV(DecisionTreeRegressor(criterion='mse'), param_grid, cv = 10,n_jobs = -1)
tree.fit(X_train, y_train)
tree_preds = tree.predict(X_test)
print X_test.shape,y_test.shape
tree_performance = tree.score(X_test, y_test)
print tree_preds
print tree_performance
print tree.best_params_
print tree.grid_scores_
print tree.best_estimator_
print tree.best_score_
# best fit tree regressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(tmp[D1], target, test_size=0.33)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
tree = DecisionTreeRegressor(criterion='mse', max_depth = 9, min_samples_split = 5)
tree.fit(X_train, y_train)
tree_preds = tree.predict(X_test)
print X_test.shape,y_test.shape
tree_performance = tree.score(X_test, y_test)
print tree_preds
print 'performance r2:',tree_performance
print tree.feature_importances_
features_df = pd.DataFrame([col for col in tmp[D1]],[i for i in tree.feature_importances_]).reset_index()
sns.set(rc={"figure.figsize": (10,10)})
sns.set_style('white')
features_df.rename(columns = {'index':'score',0:'features'},inplace = True)
features_df.set_index('features').score.sort_values().plot(kind = 'barh')
# 'year','bed','weapon laws',
sns.set(rc={"figure.figsize": (10, 10)})
sns.set_style('white')
sns.regplot(y_test,tree_preds)
selected_features = [i for i in features_df[features_df.score > .01].features]
# best fit tree regressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(tmp[D1][selected_features], target, test_size=0.33)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
tree = DecisionTreeRegressor(criterion='mse', max_depth = 9, min_samples_split = 5)
tree.fit(X_train, y_train)
tree_preds = tree.predict(X_test)
print X_test.shape,y_test.shape
print target[0:3]
tree_performance = tree.score(X_test, y_test)
print tree_preds
print 'performance r2:',tree_performance
print tree.feature_importances_
#eatures_df = pd.DataFrame([col for col in tmp[D1]],[i for i in tree.feature_importances_]).reset_index()
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score
forest = RandomForestRegressor(random_state=0, n_estimators=1000, min_samples_split = 5)
X_train, X_test, y_train, y_test = train_test_split(tmp[D1], target, test_size=0.33)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
forest.fit(X_train, y_train)
forest_preds = forest.predict(X_test)
# forest_preds = cross_val_score(estimator = forest, X_test, y= y_test, scoring=None, cv=5, n_jobs=-1, verbose=0, fit_params=None)
forest_performance = forest.score(X_test, y_test)
print forest_preds
print 'forest r2:',forest_performance
print forest
print forest.get_params
print 'performance', forest_performance
sns.set(rc={"figure.figsize": (10, 10)})
sns.set_style('white')
sns.regplot(y_test,forest_preds)
# random forest selected features
# random forest selected features
from sklearn.ensemble import RandomForestRegressor
from sklearn.cross_validation import cross_val_score
forest = RandomForestRegressor(random_state=0, n_estimators=1000, min_samples_split = 5)
X_train, X_test, y_train, y_test = train_test_split(tmp[D1][selected_features], target, test_size=0.33)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
forest.fit(X_train, y_train)
forest_preds = forest.predict(X_test)
# forest_preds = cross_val_score(estimator = forest, X_test, y= y_test, scoring=None, cv=5, n_jobs=-1, verbose=0, fit_params=None)
forest_performance = forest.score(X_test, y_test)
print forest_preds
print 'forest r2:',forest_performance
print forest
print forest.get_params
def prediction_plot(pred_array, title, x_limit, actual):
fig = plt.figure(figsize=(12,10))
frame1=fig.add_axes((.1,.3,.8,.6))
mean_ = np.mean(pred_array)
min_ = np.min(pred_array)
max_ = np.max(pred_array)
ax = fig.gca()
ax.plot(range(len(pred_array)), pred_array,
alpha=0.5, c='darkred', label='predicted Rental Price')
ax.scatter(range(len(actual)), actual, s=35,
alpha=0.7, c='blue', label='actual Rental Price')
zero = ax.axhline(0, lw=1.5, c='black')
mean_line = ax.axhline(mean_, ls='-', lw=1., c='black', alpha=0.5,
label='Mean predicted val: %0.2f'%mean_)
max_line = ax.axhline(max_, ls='--', lw=1., c='blue',alpha=0.7,
label='Max predicted val: %0.2f'%max_)
min_line = ax.axhline(min_, ls='--', lw=1., c='green',alpha=0.7,
label='Min predicted val: %0.2f'%min_)
ax.legend(fontsize=16, bbox_to_anchor=(1.5,1.))
ax.set_xlim([0,x_limit])
ax.set_ylim([100,7000])
ax.set_ylabel('Rental Price', fontsize=16)
ax.set_title(title, fontsize=17)
#Residual plot box
residual_error = actual - pred_array
residual_error = residual_error.tolist()
frame2=fig.add_axes((.1,.1,.8,.2))
frame2.scatter(range(len(residual_error)), residual_error, c='darkred', label='Residual errors')
for i in range(len(residual_error)):
frame2.plot((i,i),(0,residual_error[i]), ls='--', c='green')
frame2.axhline(0, lw=1.5, c='black', ls='--')
frame2.set_ylim([-500,500])
frame2.set_xlim([0,x_limit])
frame2.legend(bbox_to_anchor=(1.25,1.0), fontsize=15)
plt.show()
def plot_list(plot_funtion ,lists, titles, x_limit, actual):
for i,a in enumerate(lists):
plot_funtion(a,titles[i], x_limit, actual)
plot_list(prediction_plot,[forest_preds], ['Random Forest'], 50, y_test)
from sklearn.tree import DecisionTreeRegressor
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
features = {}
tmp = posts_crimes[[col for col in posts_crimes if posts_crimes[col].dtype != object or col == 'district1']]
D1 = [c for c in tmp if 'D1' in c ]+['bed','year','district1','price']
group_forest_df = tmp[D1]
for district, group_df in tmp[D1].groupby("district1"):
try:
target = group_df.price
predictors = group_df[[col for col in group_df if col not in ['price','district1']]]
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.33)
y_train = np.ravel(y_train)
y_test = np.ravel(y_test)
model = RandomForestRegressor()
forest = model.fit(X_train, y_train)
forest_performance = model.score(X_test, y_test)
print district, forest_performance
features[district] = forest.feature_importances_
except:
continue
features_df = pd.DataFrame(features).T
features_df.columns = [i for i in predictors.columns]
features_df = features_df.T
features_df.head()
Saw price differences and theft differences in different neighborhoods, thought something more complex was happening. See how different neighborhoods react according to the model . Saw different feature importances meaning that something interesting is going on. SOMA has a very high feature importance of Larceny and Theft while no other neighborhood does. In addition while running my model of random forest for each individual district I saw a huge variance in the score for each district, with some at .8 like USF while another one had -.32 like Lakeshore which means the distribution of counts of each district widely varies.
for index,i in enumerate(['mission district','south of market','marina','russian hill','north beach','tenderloin']):
plt.figure(index)
features_df_sub = features_df[i].sort_values()
sns.set_style('white')
features_df_sub.plot(kind = 'barh',title = i +' feature_importances')
print i
map_choro = pd.read_csv('/Users/Stav/Downloads/new_geoms_2.csv')
from branca.colormap import linear
colormap = linear.YlOrRd.scale(
1000,3500)
#map_choro.price.max())
price_dict = {}
for k,v in zip(map_choro['district1'],map_choro['price']):
price_dict[k] = v
color_dict = {key: colormap(price_dict[key]) for key in price_dict.keys()}
color_dict
Different feature importances show that neighborhoods attributes such as crime counts of different types affect pricing differently. From my analysis we can say that crime is not directly related to pricing but yet pricing might affect which types of crimes take place where.
from IPython.display import HTML
HTML('<iframe width="100%" height="520" frameborder="0" src="https://stavgrossfeld.cartodb.com/viz/f411c652-3cdf-11e6-8073-0ecfd53eb7d3/embed_map" \
allowfullscreen webkitallowfullscreen mozallowfullscreen oallowfullscreen msallowfullscreen></iframe>')
from folium.map import Popup, Icon, Marker
import folium
from folium import plugins
crime_sub_map = crime_sub
crime_sub_map['Date'] = crime_sub.index
#Cartodb Dark_matter
m = folium.Map(location=[37.7599,-122.431297],tiles='Cartodb Dark_matter', zoom_start = 12, min_zoom = 12, max_zoom = 15)
folium.GeoJson(open('/Users/Stav/Downloads/new_geoms_2.geojson'),style_function=lambda feature:{'fillColor': color_dict[feature['properties']['district1']],'fillOpacity':'.5','color' : 'white','weight':'3','dashArray':'5,5'}).add_to(m)
marker_cluster = folium.MarkerCluster("Theft").add_to(m)
for each in crime_sub_map[crime_sub_map['is_theft']==1][0:1000].iterrows():
#print each[1]['Category'], each[1]['DayOfWeek'], each['Time']
popup = str('crime: ' + each[1]['Category'] + '\n' + 'date: ' + str(each[1]['Date']).split()[0] + '\n' 'day: ' + each[1]['DayOfWeek'] + '\n' + 'time: ' + each[1]['Time'])
folium.Marker(location = [each[1]['Y'],each[1]['X']],popup=popup).add_to(marker_cluster)
print colormap
m
# display(HTML("""
# <div style="position: absolute">
# <div id='legend-form'>
# <h4>rental price</h4>
# <select name="something">
# <option>price</option>
# <option>crime</option>
# </select>
# </div>
# </div>
# """))
colormap
posts_crimes.head()
from folium.map import Popup, Icon, Marker
import folium
from folium import plugins
from IPython.display import clear_output
from ipywidgets import *
from IPython.display import display
hello = [i for i in crime_sub.Category.unique()]
from ipywidgets import Dropdown
crime = Dropdown(
options= hello,
value = hello[0],
description='Crimes: ',
)
submit_button = Button(description="Submit Info")
submit_button.on_click(test) # test is my mapping function
display(crime)
display(submit_button)
def test(crime_sub):
clear_output()
print crime.value
crime_select = crime_sub_map[crime_sub_map.Category == crime.value]
select_map = folium.Map(location=[37.7599,-122.431297],tiles='Cartodb Dark_matter', zoom_start = 12, min_zoom = 12, max_zoom = 15)
folium.GeoJson(open('/Users/Stav/Downloads/new_geoms_2.geojson'),style_function=lambda feature:{'fillColor': color_dict[feature['properties']['district1']],'fillOpacity':'.5','color' : 'white','weight':'3','dashArray':'5,5'}).add_to(select_map)
marker_cluster = folium.MarkerCluster("select").add_to(select_map)
for each in crime_select[0:1000].iterrows():
#print each[1]['Category'], each[1]['DayOfWeek'], each['Time']
popup = str('crime: ' + each[1]['Category'] + '\n' + 'date: ' + str(each[1]['Date']).split()[0] + '\n' 'day: ' + each[1]['DayOfWeek'] + '\n' + 'time: ' + each[1]['Time'])
folium.Marker(location = [each[1]['Y'],each[1]['X']],popup=popup).add_to(marker_cluster)
clear_output()
display(select_map)
display(colormap)
clear_output
# after they submit destination.values